{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6692f22d-a9b1-4ebe-b4d4-d8f141ba36c4",
   "metadata": {},
   "source": [
    "## 赛事背景\n",
    "随着互联网内容的快速增长，网络平台面临日益严峻的违禁信息治理挑战。违禁词（如涉政敏感、色情低俗、暴力犯罪、宗教迷信等）通过谐音、缩写、符号插入、多语言混合等动态变体形式，持续规避传统检测规则，对网络生态安全和用户体验造成严重威胁。现有基于关键词匹配或简单规则的方法难以应对复杂语境下的语义歧义和对抗性干扰，亟需通过人工智能技术提升违禁内容识别的智能化水平。\n",
    "\n",
    "## 赛事任务\n",
    "本次大赛要求参赛者构建高效的机器学习或深度学习模型，用于准确识别文本中违禁词汇的类别。为此，提供经过脱敏处理的大规模真实网络文本数据集，包括但不限于社交媒体帖子、新闻评论等。\n",
    "\n",
    "## 评审规则\n",
    "1. 数据说明\n",
    "\n",
    "本次比赛为参赛选手提供的数据包括从各大社交平台采集的用户发言，数据包含主要文本以及对应的违禁分类，以便选手利用。\n",
    "\n",
    "2. 评估指标\n",
    "\n",
    "本模型依据提交的结果文件，采用macro F1-score进行评价。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a8b893bb-1254-4806-90da-78b1da40833e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import jieba\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.svm import LinearSVC\n",
    "from sklearn.model_selection import cross_val_predict\n",
    "from sklearn.metrics import f1_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "1f289065-d028-4432-8833-fa12bf06d5bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "train = pd.read_csv(\"dataset/train_all.csv\")\n",
    "test = pd.read_csv(\"dataset/test_text.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "2ffd665f-d6cd-43b1-af54-ade8e99281c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "category_list = list(train[\"类别\"].unique())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "adc77ca4-5c6c-45ae-98f2-ed8f53e01cac",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>类别</th>\n",
       "      <th>文本</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>种族歧视</td>\n",
       "      <td>这政治正确玩的，真NM6。 剧黑农民了么？农民出身的官员确实有腐败现象能否认么？以后反腐剧里...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>政治敏感</td>\n",
       "      <td>刁进平</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>微侵犯(MA)</td>\n",
       "      <td>一直都没发现这条惊天大驴，直到突然怀崽才发现，一怀孕就开始歪屁股     举报       ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>色情</td>\n",
       "      <td>干的爽</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>种族歧视</td>\n",
       "      <td>黑人在美国侮辱华人，那我支持白人警察打击黑人有什么问题・_・?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25095</th>\n",
       "      <td>25095</td>\n",
       "      <td>种族歧视</td>\n",
       "      <td>实际上，如果你不是白种女孩的话是很容易被歧视的，如果你是黑人，</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25096</th>\n",
       "      <td>25096</td>\n",
       "      <td>政治敏感</td>\n",
       "      <td>大纪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25097</th>\n",
       "      <td>25097</td>\n",
       "      <td>色情</td>\n",
       "      <td>逃亡艳旅*</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25098</th>\n",
       "      <td>25098</td>\n",
       "      <td>基于文化背景的刻板印象(SCB)</td>\n",
       "      <td>并且总感觉是举着帮助女人学会经营情感的大旗干着教女人怎么讨好迎合男人的勾当</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25099</th>\n",
       "      <td>25099</td>\n",
       "      <td>政治敏感</td>\n",
       "      <td>回民猪布什</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>25100 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "          id                类别  \\\n",
       "0          0              种族歧视   \n",
       "1          1              政治敏感   \n",
       "2          2           微侵犯(MA)   \n",
       "3          3                色情   \n",
       "4          4              种族歧视   \n",
       "...      ...               ...   \n",
       "25095  25095              种族歧视   \n",
       "25096  25096              政治敏感   \n",
       "25097  25097                色情   \n",
       "25098  25098  基于文化背景的刻板印象(SCB)   \n",
       "25099  25099              政治敏感   \n",
       "\n",
       "                                                      文本  \n",
       "0      这政治正确玩的，真NM6。 剧黑农民了么？农民出身的官员确实有腐败现象能否认么？以后反腐剧里...  \n",
       "1                                                    刁进平  \n",
       "2      一直都没发现这条惊天大驴，直到突然怀崽才发现，一怀孕就开始歪屁股     举报       ...  \n",
       "3                                                    干的爽  \n",
       "4                        黑人在美国侮辱华人，那我支持白人警察打击黑人有什么问题・_・?  \n",
       "...                                                  ...  \n",
       "25095                    实际上，如果你不是白种女孩的话是很容易被歧视的，如果你是黑人，  \n",
       "25096                                                 大纪  \n",
       "25097                                              逃亡艳旅*  \n",
       "25098              并且总感觉是举着帮助女人学会经营情感的大旗干着教女人怎么讨好迎合男人的勾当  \n",
       "25099                                              回民猪布什  \n",
       "\n",
       "[25100 rows x 3 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ab05211-346b-447b-b345-50a2a615df9a",
   "metadata": {},
   "source": [
    "## TFIDF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "cd6a122a-02e8-42ee-ba04-12b01a9018e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "tfidf = TfidfVectorizer(tokenizer=jieba.lcut)\n",
    "train_tfidf = tfidf.fit_transform(train[\"文本\"])\n",
    "test_tfidf = tfidf.transform(test[\"文本\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "ff5f442e-20ea-4cc7-a64d-fee2329e9603",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5846150237635646"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pred = cross_val_predict(\n",
    "    LinearSVC(),\n",
    "    train_tfidf,\n",
    "    train[\"类别\"]\n",
    ")\n",
    "f1_score(train[\"类别\"], pred, average=\"macro\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96fbbe26-2c01-42c3-82ff-9f501804b7e5",
   "metadata": {},
   "source": [
    "## BERT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "99a3550b-118a-4449-8a6e-31bb5b153083",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "from torch.utils.data import Dataset, DataLoader, TensorDataset\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import random\n",
    "import re\n",
    "\n",
    "x_train, x_test, train_label, test_label =  train_test_split(train[\"文本\"].values, \n",
    "                                                             train[\"类别\"].values, \n",
    "                                                             test_size=0.2, \n",
    "                                                             stratify=train[\"类别\"].values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "78286291-50fb-4c98-8897-809e9953e08b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import BertTokenizer\n",
    "# 分词器，词典\n",
    "\n",
    "tokenizer = BertTokenizer.from_pretrained('/home/lyz/hf-models/bert-base-chinese')\n",
    "train_encoding = tokenizer(list(x_train), truncation=True, padding=True, max_length=64)\n",
    "test_encoding = tokenizer(list(x_test), truncation=True, padding=True, max_length=64)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "07a74144-573d-470d-8102-09b26b823f9d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 数据集读取\n",
    "class NewsDataset(Dataset):\n",
    "    def __init__(self, encodings, labels):\n",
    "        self.encodings = encodings\n",
    "        self.labels = labels\n",
    "    \n",
    "    # 读取单个样本\n",
    "    def __getitem__(self, idx):\n",
    "        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}\n",
    "        item['labels'] = torch.tensor(int(category_list.index(self.labels[idx])))\n",
    "        return item\n",
    "    \n",
    "    def __len__(self):\n",
    "        return len(self.labels)\n",
    "\n",
    "train_dataset = NewsDataset(train_encoding, train_label)\n",
    "test_dataset = NewsDataset(test_encoding, test_label)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "9888b8c3-fb0f-4881-95a5-098fcac70bb6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'input_ids': tensor([ 101, 1762,  704, 1066, 2190, 1920, 3791, 1355, 1220, 3655, 6999, 6833,\n",
       "         2154, 1400, 8024, 3300,  782, 2218, 3221, 1728,  711,  704, 1066, 4638,\n",
       "         6439, 6470, 2798, 1343,  749, 6237, 1920, 3791, 8024,  794, 5445, 2533,\n",
       "         3791, 4638, 8024, 3300, 4638, 3221,  711,  749, 1091, 2821, 1161, 1920,\n",
       "         3791, 4638, 3152, 4995, 3341,  749, 6237, 1920, 3791, 8024, 3297, 1400,\n",
       "         3152, 4995, 3766,  102]),\n",
       " 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
       "         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
       "         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),\n",
       " 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
       "         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
       "         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),\n",
       " 'labels': tensor(4)}"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_dataset[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "ac7a3425-0902-4e4b-96bf-d7bfbf2f727b",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /home/lyz/hf-models/bert-base-chinese and are newly initialized: ['classifier.bias', 'classifier.weight']\n",
      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
     ]
    }
   ],
   "source": [
    "# 精度计算\n",
    "def flat_accuracy(preds, labels):\n",
    "    pred_flat = np.argmax(preds, axis=1).flatten()\n",
    "    labels_flat = labels.flatten()\n",
    "    return np.sum(pred_flat == labels_flat) / len(labels_flat)\n",
    "\n",
    "from transformers import BertForSequenceClassification, get_linear_schedule_with_warmup\n",
    "model = BertForSequenceClassification.from_pretrained('/home/lyz/hf-models/bert-base-chinese', num_labels=len(category_list))\n",
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "model.to(device)\n",
    "\n",
    "# 单个读取到批量读取\n",
    "train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)\n",
    "test_dataloader = DataLoader(test_dataset, batch_size=16, shuffle=True)\n",
    "\n",
    "# 优化方法\n",
    "optim = torch.optim.AdamW(model.parameters(), lr=2e-5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "5db68935-4ba9-42a7-a969-104e216181fa",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "------------Epoch: 0 ----------------\n",
      "epoth: 0, iter_num: 100, loss: 0.4526, 7.97%\n",
      "epoth: 0, iter_num: 200, loss: 0.3962, 15.94%\n",
      "epoth: 0, iter_num: 300, loss: 0.1823, 23.90%\n",
      "epoth: 0, iter_num: 400, loss: 0.2082, 31.87%\n",
      "epoth: 0, iter_num: 500, loss: 0.2768, 39.84%\n",
      "epoth: 0, iter_num: 600, loss: 0.0532, 47.81%\n",
      "epoth: 0, iter_num: 700, loss: 0.8204, 55.78%\n",
      "epoth: 0, iter_num: 800, loss: 0.6132, 63.75%\n",
      "epoth: 0, iter_num: 900, loss: 0.1600, 71.71%\n",
      "epoth: 0, iter_num: 1000, loss: 0.5317, 79.68%\n",
      "epoth: 0, iter_num: 1100, loss: 0.8300, 87.65%\n",
      "epoth: 0, iter_num: 1200, loss: 0.3585, 95.62%\n",
      "Epoch: 0, Average training loss: 0.3998\n",
      "Accuracy: 0.9145\n",
      "Average testing loss: 0.2834\n",
      "-------------------------------\n",
      "------------Epoch: 1 ----------------\n",
      "epoth: 1, iter_num: 100, loss: 0.1693, 7.97%\n",
      "epoth: 1, iter_num: 200, loss: 0.1318, 15.94%\n",
      "epoth: 1, iter_num: 300, loss: 0.1054, 23.90%\n",
      "epoth: 1, iter_num: 400, loss: 0.4504, 31.87%\n",
      "epoth: 1, iter_num: 500, loss: 0.0180, 39.84%\n",
      "epoth: 1, iter_num: 600, loss: 0.0161, 47.81%\n",
      "epoth: 1, iter_num: 700, loss: 0.2314, 55.78%\n",
      "epoth: 1, iter_num: 800, loss: 0.1189, 63.75%\n",
      "epoth: 1, iter_num: 900, loss: 0.0610, 71.71%\n",
      "epoth: 1, iter_num: 1000, loss: 0.0495, 79.68%\n",
      "epoth: 1, iter_num: 1100, loss: 0.6081, 87.65%\n",
      "epoth: 1, iter_num: 1200, loss: 0.0242, 95.62%\n",
      "Epoch: 1, Average training loss: 0.2358\n",
      "Accuracy: 0.9168\n",
      "Average testing loss: 0.2740\n",
      "-------------------------------\n",
      "------------Epoch: 2 ----------------\n"
     ]
    },
    {
     "ename": "KeyboardInterrupt",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mKeyboardInterrupt\u001b[0m                         Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[20], line 60\u001b[0m\n\u001b[1;32m     58\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m epoch \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mrange\u001b[39m(\u001b[38;5;241m4\u001b[39m):\n\u001b[1;32m     59\u001b[0m     \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m------------Epoch: \u001b[39m\u001b[38;5;132;01m%d\u001b[39;00m\u001b[38;5;124m ----------------\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;241m%\u001b[39m epoch)\n\u001b[0;32m---> 60\u001b[0m     \u001b[43mtrain\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m     61\u001b[0m     validation()\n",
      "Cell \u001b[0;32mIn[20], line 11\u001b[0m, in \u001b[0;36mtrain\u001b[0;34m()\u001b[0m\n\u001b[1;32m      7\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m batch \u001b[38;5;129;01min\u001b[39;00m train_loader:\n\u001b[1;32m      8\u001b[0m     \u001b[38;5;66;03m# 正向传播\u001b[39;00m\n\u001b[1;32m      9\u001b[0m     optim\u001b[38;5;241m.\u001b[39mzero_grad()\n\u001b[0;32m---> 11\u001b[0m     input_ids \u001b[38;5;241m=\u001b[39m \u001b[43mbatch\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43minput_ids\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m]\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mto\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdevice\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m     12\u001b[0m     attention_mask \u001b[38;5;241m=\u001b[39m batch[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mattention_mask\u001b[39m\u001b[38;5;124m'\u001b[39m]\u001b[38;5;241m.\u001b[39mto(device)\n\u001b[1;32m     13\u001b[0m     labels \u001b[38;5;241m=\u001b[39m batch[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mlabels\u001b[39m\u001b[38;5;124m'\u001b[39m]\u001b[38;5;241m.\u001b[39mto(device)\n",
      "\u001b[0;31mKeyboardInterrupt\u001b[0m: "
     ]
    }
   ],
   "source": [
    "# 训练函数\n",
    "def train():\n",
    "    model.train()\n",
    "    total_train_loss = 0\n",
    "    iter_num = 0\n",
    "    total_iter = len(train_loader)\n",
    "    for batch in train_loader:\n",
    "        # 正向传播\n",
    "        optim.zero_grad()\n",
    "        \n",
    "        input_ids = batch['input_ids'].to(device)\n",
    "        attention_mask = batch['attention_mask'].to(device)\n",
    "        labels = batch['labels'].to(device)\n",
    "        \n",
    "        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)\n",
    "        loss = outputs[0]\n",
    "        total_train_loss += loss.item()\n",
    "        \n",
    "        # 反向梯度信息\n",
    "        loss.backward()\n",
    "        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)\n",
    "        \n",
    "        # 参数更新\n",
    "        optim.step()\n",
    "\n",
    "        iter_num += 1\n",
    "        if(iter_num % 100==0):\n",
    "            print(\"epoth: %d, iter_num: %d, loss: %.4f, %.2f%%\" % (epoch, iter_num, loss.item(), iter_num/total_iter*100))\n",
    "        \n",
    "    print(\"Epoch: %d, Average training loss: %.4f\"%(epoch, total_train_loss/len(train_loader)))\n",
    "    \n",
    "def validation():\n",
    "    model.eval()\n",
    "    total_eval_accuracy = 0\n",
    "    total_eval_loss = 0\n",
    "    for batch in test_dataloader:\n",
    "        with torch.no_grad():\n",
    "            # 正常传播\n",
    "            input_ids = batch['input_ids'].to(device)\n",
    "            attention_mask = batch['attention_mask'].to(device)\n",
    "            labels = batch['labels'].to(device)\n",
    "            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)\n",
    "        \n",
    "        loss = outputs[0]\n",
    "        logits = outputs[1]\n",
    "\n",
    "        total_eval_loss += loss.item()\n",
    "        logits = logits.detach().cpu().numpy()\n",
    "        label_ids = labels.to('cpu').numpy()\n",
    "        total_eval_accuracy += flat_accuracy(logits, label_ids)\n",
    "        \n",
    "    avg_val_accuracy = total_eval_accuracy / len(test_dataloader)\n",
    "    print(\"Accuracy: %.4f\" % (avg_val_accuracy))\n",
    "    print(\"Average testing loss: %.4f\"%(total_eval_loss/len(test_dataloader)))\n",
    "    print(\"-------------------------------\")\n",
    "    \n",
    "\n",
    "for epoch in range(2):\n",
    "    print(\"------------Epoch: %d ----------------\" % epoch)\n",
    "    train()\n",
    "    validation()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "fb0cf3f6-6d46-4287-8abd-af4b788f5f05",
   "metadata": {},
   "outputs": [],
   "source": [
    "test_encoding = tokenizer(list(test[\"文本\"]), truncation=True, padding=True, max_length=64)\n",
    "test_dataset = NewsDataset(test_encoding, ['种族歧视'] * len(test[\"文本\"]))\n",
    "test_dataloader = DataLoader(test_dataset, batch_size=16, shuffle=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "734ca7d7-9dd8-4487-a09a-cb55948438d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "label = []\n",
    "for batch in test_dataloader:\n",
    "    with torch.no_grad():\n",
    "        # 正常传播\n",
    "        input_ids = batch['input_ids'].to(device)\n",
    "        attention_mask = batch['attention_mask'].to(device)\n",
    "        labels = batch['labels'].to(device)\n",
    "        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)\n",
    "    pred = outputs.logits.data.cpu().numpy().argmax(1)\n",
    "    label += [category_list[x] for x in pred]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "f7dab83a-4387-4a8c-af01-2610701bebb5",
   "metadata": {},
   "outputs": [],
   "source": [
    "test[\"类别\"] = label"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "71291408-77f6-499f-b220-aba4ca92805e",
   "metadata": {},
   "outputs": [],
   "source": [
    "test[[\"id\", \"类别\"]].to_csv(\"submit_bert.csv\", index=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "642c5f27-fa17-49cc-9c4d-fad6a4a21589",
   "metadata": {},
   "source": [
    "## Qwen"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "793b56ec-eae5-4f1e-ad1e-71e4872eef7b",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-06-30 21:08:48.798987: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
      "To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\" \n",
    "\n",
    "from datasets import Dataset\n",
    "import pandas as pd\n",
    "from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer, GenerationConfig\n",
    "\n",
    "train_data = pd.read_csv(\"dataset/train_all.csv\")\n",
    "test = pd.read_csv(\"dataset/test_text.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "6b38eb29-ceeb-4a90-9e8f-f28f4e3d33b4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'id': [0, 1, 2],\n",
       " 'output': ['种族歧视', '政治敏感', '微侵犯(MA)'],\n",
       " 'instruction': ['这政治正确玩的，真NM6。 剧黑农民了么？农民出身的官员确实有腐败现象能否认么？以后反腐剧里的坏人就必须是官二代城市人的身份咯？少民肯定也不行咯？女人LGBT黑人宗教人士肯定也都不行咯？ 一群白（黄？）左在下面跟着高潮也是醉了',\n",
       "  '刁进平',\n",
       "  '一直都没发现这条惊天大驴，直到突然怀崽才发现，一怀孕就开始歪屁股    \\xa0举报    \\xa0        赞[12]        \\xa0回复        \\xa0    05月12日 00:29\\xa0来自网页'],\n",
       " 'input': ['', '', '']}"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_data[\"input\"] = \"\"\n",
    "train_data.columns = [\"id\",\"output\", \"instruction\", \"input\"]\n",
    "ds = Dataset.from_pandas(train_data)\n",
    "ds[:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "21c93f70-f2de-400c-9ef3-4c2b61df6d96",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Qwen2Tokenizer(name_or_path='/home/lyz/hf-models/Qwen/Qwen1.5-1.8B-Chat/', vocab_size=151643, model_max_length=32768, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={\n",
       "\t151643: AddedToken(\"<|endoftext|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
       "\t151644: AddedToken(\"<|im_start|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
       "\t151645: AddedToken(\"<|im_end|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
       "}\n",
       ")"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenizer = AutoTokenizer.from_pretrained(\"/home/lyz/hf-models/Qwen/Qwen1.5-1.8B-Chat/\", use_fast=False, trust_remote_code=True)\n",
    "tokenizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "0220af3f-b3de-48fa-9ed9-8d2ce30676a2",
   "metadata": {},
   "outputs": [],
   "source": [
    "def process_func(example):\n",
    "    MAX_LENGTH = 128    # Llama分词器会将一个中文字切分为多个token，因此需要放开一些最大长度，保证数据的完整性\n",
    "    input_ids, attention_mask, labels = [], [], []\n",
    "    instruction = tokenizer(f\"<|im_start|>system\\n违禁文本分类<|im_end|>\\n<|im_start|>user\\n{example['instruction'] + example['input']}<|im_end|>\\n<|im_start|>assistant\\n\", add_special_tokens=False)  # add_special_tokens 不在开头加 special_tokens\n",
    "    response = tokenizer(f\"{example['output']}\", add_special_tokens=False)\n",
    "    input_ids = instruction[\"input_ids\"] + response[\"input_ids\"] + [tokenizer.pad_token_id]\n",
    "    attention_mask = instruction[\"attention_mask\"] + response[\"attention_mask\"] + [1]  # 因为eos token咱们也是要关注的所以 补充为1\n",
    "    labels = [-100] * len(instruction[\"input_ids\"]) + response[\"input_ids\"] + [tokenizer.pad_token_id]  \n",
    "    if len(input_ids) > MAX_LENGTH:  # 做一个截断\n",
    "        input_ids = input_ids[:MAX_LENGTH]\n",
    "        attention_mask = attention_mask[:MAX_LENGTH]\n",
    "        labels = labels[:MAX_LENGTH]\n",
    "    return {\n",
    "        \"input_ids\": input_ids,\n",
    "        \"attention_mask\": attention_mask,\n",
    "        \"labels\": labels\n",
    "    }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "84c2c47e-e68e-47d6-9f6e-8cc3566da40b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "d1e43edbd296484f81c3b770dcc96d08",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Map:   0%|          | 0/25100 [00:00<?, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'<|im_start|>system\\n违禁文本分类<|im_end|>\\n<|im_start|>user\\n这政治正确玩的，真NM6。 剧黑农民了么？农民出身的官员确实有腐败现象能否认么？以后反腐剧里的坏人就必须是官二代城市人的身份咯？少民肯定也不行咯？女人LGBT黑人宗教人士肯定也都不行咯？ 一群白（黄？）左在下面跟着高潮也是醉了<|im_end|>\\n<|im_start|>assistant\\n种族歧视<|endoftext|>'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenized_id = ds.map(process_func, remove_columns=ds.column_names)\n",
    "tokenizer.decode(tokenized_id[0]['input_ids'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "bfb4749a-16b2-4241-9915-67161180a75d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'政治敏感<|endoftext|>'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenizer.decode(list(filter(lambda x: x != -100, tokenized_id[1][\"labels\"])))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "74fa8a5f-d638-48d8-baed-231971b117ea",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "b14e734b387845a7b4cae9fb0e4ff290",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Map:   0%|          | 0/24100 [00:00<?, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c782dc2bfa3c4011806d06cb2ac9d325",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Map:   0%|          | 0/1000 [00:00<?, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "train_ds = Dataset.from_pandas(train_data.iloc[:-1000])\n",
    "eval_ds = Dataset.from_pandas(train_data[-1000:])\n",
    "\n",
    "train_tokenized_id = train_ds.map(process_func, remove_columns=ds.column_names)\n",
    "eval_tokenized_id = eval_ds.map(process_func, remove_columns=ds.column_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "56a614a2-125e-49db-af92-5497a844ca67",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "\n",
    "model = AutoModelForCausalLM.from_pretrained(\"/home/lyz/hf-models/Qwen/Qwen1.5-1.8B-Chat/\", device_map=\"auto\")\n",
    "model.enable_input_require_grads() # 开启梯度检查点时，要执行该方法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "90a165b4-3fad-47b1-b5af-8f528b26c803",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "trainable params: 7,495,680 || all params: 1,844,324,352 || trainable%: 0.4064187512284173\n"
     ]
    }
   ],
   "source": [
    "from peft import LoraConfig, TaskType, get_peft_model\n",
    "\n",
    "config = LoraConfig(\n",
    "    task_type=TaskType.CAUSAL_LM, \n",
    "    target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\", \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
    "    inference_mode=False, # 训练模式\n",
    "    r=8, # Lora 秩\n",
    "    lora_alpha=32, # Lora alaph，具体作用参见 Lora 原理\n",
    "    lora_dropout=0.1# Dropout 比例\n",
    ")\n",
    "\n",
    "model = get_peft_model(model, config)\n",
    "model.print_trainable_parameters()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "cf7a0b8a-bd9c-4c3e-a467-b89888594aba",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/lyz/anaconda3/envs/py311/lib/python3.11/site-packages/transformers/training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead\n",
      "  warnings.warn(\n",
      "No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.\n",
      "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.\n",
      "/home/lyz/anaconda3/envs/py311/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.\n",
      "  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "\n",
       "    <div>\n",
       "      \n",
       "      <progress value='6' max='5020' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
       "      [   6/5020 00:11 < 3:55:58, 0.35 it/s, Epoch 0.00/5]\n",
       "    </div>\n",
       "    <table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       " <tr style=\"text-align: left;\">\n",
       "      <th>Step</th>\n",
       "      <th>Training Loss</th>\n",
       "      <th>Validation Loss</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table><p>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "ename": "OutOfMemoryError",
     "evalue": "CUDA out of memory. Tried to allocate 1.24 GiB. GPU 0 has a total capacity of 10.90 GiB of which 508.44 MiB is free. Including non-PyTorch memory, this process has 10.40 GiB memory in use. Of the allocated memory 9.84 GiB is allocated by PyTorch, and 403.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mOutOfMemoryError\u001b[0m                          Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[10], line 23\u001b[0m\n\u001b[1;32m      1\u001b[0m args \u001b[38;5;241m=\u001b[39m TrainingArguments(\n\u001b[1;32m      2\u001b[0m     output_dir\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m./output_Qwen1.5\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m      3\u001b[0m     per_device_train_batch_size\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m6\u001b[39m,\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m     14\u001b[0m     load_best_model_at_end\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m\n\u001b[1;32m     15\u001b[0m )\n\u001b[1;32m     16\u001b[0m trainer \u001b[38;5;241m=\u001b[39m Trainer(\n\u001b[1;32m     17\u001b[0m     model\u001b[38;5;241m=\u001b[39mmodel,\n\u001b[1;32m     18\u001b[0m     args\u001b[38;5;241m=\u001b[39margs,\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m     21\u001b[0m     data_collator\u001b[38;5;241m=\u001b[39mDataCollatorForSeq2Seq(tokenizer\u001b[38;5;241m=\u001b[39mtokenizer, padding\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m),\n\u001b[1;32m     22\u001b[0m )\n\u001b[0;32m---> 23\u001b[0m \u001b[43mtrainer\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtrain\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n",
      "File \u001b[0;32m~/anaconda3/envs/py311/lib/python3.11/site-packages/transformers/trainer.py:2245\u001b[0m, in \u001b[0;36mTrainer.train\u001b[0;34m(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)\u001b[0m\n\u001b[1;32m   2243\u001b[0m         hf_hub_utils\u001b[38;5;241m.\u001b[39menable_progress_bars()\n\u001b[1;32m   2244\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 2245\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43minner_training_loop\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m   2246\u001b[0m \u001b[43m        \u001b[49m\u001b[43margs\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   2247\u001b[0m \u001b[43m        \u001b[49m\u001b[43mresume_from_checkpoint\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mresume_from_checkpoint\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   2248\u001b[0m \u001b[43m        \u001b[49m\u001b[43mtrial\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtrial\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   2249\u001b[0m \u001b[43m        \u001b[49m\u001b[43mignore_keys_for_eval\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mignore_keys_for_eval\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   2250\u001b[0m \u001b[43m    \u001b[49m\u001b[43m)\u001b[49m\n",
      "File \u001b[0;32m~/anaconda3/envs/py311/lib/python3.11/site-packages/transformers/trainer.py:2556\u001b[0m, in \u001b[0;36mTrainer._inner_training_loop\u001b[0;34m(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)\u001b[0m\n\u001b[1;32m   2549\u001b[0m context \u001b[38;5;241m=\u001b[39m (\n\u001b[1;32m   2550\u001b[0m     functools\u001b[38;5;241m.\u001b[39mpartial(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39maccelerator\u001b[38;5;241m.\u001b[39mno_sync, model\u001b[38;5;241m=\u001b[39mmodel)\n\u001b[1;32m   2551\u001b[0m     \u001b[38;5;28;01mif\u001b[39;00m i \u001b[38;5;241m!=\u001b[39m \u001b[38;5;28mlen\u001b[39m(batch_samples) \u001b[38;5;241m-\u001b[39m \u001b[38;5;241m1\u001b[39m\n\u001b[1;32m   2552\u001b[0m     \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39maccelerator\u001b[38;5;241m.\u001b[39mdistributed_type \u001b[38;5;241m!=\u001b[39m DistributedType\u001b[38;5;241m.\u001b[39mDEEPSPEED\n\u001b[1;32m   2553\u001b[0m     \u001b[38;5;28;01melse\u001b[39;00m contextlib\u001b[38;5;241m.\u001b[39mnullcontext\n\u001b[1;32m   2554\u001b[0m )\n\u001b[1;32m   2555\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m context():\n\u001b[0;32m-> 2556\u001b[0m     tr_loss_step \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtraining_step\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmodel\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43minputs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mnum_items_in_batch\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m   2558\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m (\n\u001b[1;32m   2559\u001b[0m     args\u001b[38;5;241m.\u001b[39mlogging_nan_inf_filter\n\u001b[1;32m   2560\u001b[0m     \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m is_torch_xla_available()\n\u001b[1;32m   2561\u001b[0m     \u001b[38;5;129;01mand\u001b[39;00m (torch\u001b[38;5;241m.\u001b[39misnan(tr_loss_step) \u001b[38;5;129;01mor\u001b[39;00m torch\u001b[38;5;241m.\u001b[39misinf(tr_loss_step))\n\u001b[1;32m   2562\u001b[0m ):\n\u001b[1;32m   2563\u001b[0m     \u001b[38;5;66;03m# if loss is nan or inf simply add the average of previous logged losses\u001b[39;00m\n\u001b[1;32m   2564\u001b[0m     tr_loss \u001b[38;5;241m=\u001b[39m tr_loss \u001b[38;5;241m+\u001b[39m tr_loss \u001b[38;5;241m/\u001b[39m (\u001b[38;5;241m1\u001b[39m \u001b[38;5;241m+\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mstate\u001b[38;5;241m.\u001b[39mglobal_step \u001b[38;5;241m-\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_globalstep_last_logged)\n",
      "File \u001b[0;32m~/anaconda3/envs/py311/lib/python3.11/site-packages/transformers/trainer.py:3764\u001b[0m, in \u001b[0;36mTrainer.training_step\u001b[0;34m(***failed resolving arguments***)\u001b[0m\n\u001b[1;32m   3761\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39maccelerator\u001b[38;5;241m.\u001b[39mdistributed_type \u001b[38;5;241m==\u001b[39m DistributedType\u001b[38;5;241m.\u001b[39mDEEPSPEED:\n\u001b[1;32m   3762\u001b[0m     kwargs[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mscale_wrt_gas\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mFalse\u001b[39;00m\n\u001b[0;32m-> 3764\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43maccelerator\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mbackward\u001b[49m\u001b[43m(\u001b[49m\u001b[43mloss\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m   3766\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m loss\u001b[38;5;241m.\u001b[39mdetach()\n",
      "File \u001b[0;32m~/anaconda3/envs/py311/lib/python3.11/site-packages/accelerate/accelerator.py:2553\u001b[0m, in \u001b[0;36mAccelerator.backward\u001b[0;34m(self, loss, **kwargs)\u001b[0m\n\u001b[1;32m   2551\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mlomo_backward(loss, learning_rate)\n\u001b[1;32m   2552\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 2553\u001b[0m     \u001b[43mloss\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mbackward\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
      "File \u001b[0;32m~/anaconda3/envs/py311/lib/python3.11/site-packages/torch/_tensor.py:521\u001b[0m, in \u001b[0;36mTensor.backward\u001b[0;34m(self, gradient, retain_graph, create_graph, inputs)\u001b[0m\n\u001b[1;32m    511\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m has_torch_function_unary(\u001b[38;5;28mself\u001b[39m):\n\u001b[1;32m    512\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m handle_torch_function(\n\u001b[1;32m    513\u001b[0m         Tensor\u001b[38;5;241m.\u001b[39mbackward,\n\u001b[1;32m    514\u001b[0m         (\u001b[38;5;28mself\u001b[39m,),\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    519\u001b[0m         inputs\u001b[38;5;241m=\u001b[39minputs,\n\u001b[1;32m    520\u001b[0m     )\n\u001b[0;32m--> 521\u001b[0m \u001b[43mtorch\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mautograd\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mbackward\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m    522\u001b[0m \u001b[43m    \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mgradient\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mretain_graph\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcreate_graph\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43minputs\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43minputs\u001b[49m\n\u001b[1;32m    523\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n",
      "File \u001b[0;32m~/anaconda3/envs/py311/lib/python3.11/site-packages/torch/autograd/__init__.py:289\u001b[0m, in \u001b[0;36mbackward\u001b[0;34m(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)\u001b[0m\n\u001b[1;32m    284\u001b[0m     retain_graph \u001b[38;5;241m=\u001b[39m create_graph\n\u001b[1;32m    286\u001b[0m \u001b[38;5;66;03m# The reason we repeat the same comment below is that\u001b[39;00m\n\u001b[1;32m    287\u001b[0m \u001b[38;5;66;03m# some Python versions print out the first line of a multi-line function\u001b[39;00m\n\u001b[1;32m    288\u001b[0m \u001b[38;5;66;03m# calls in the traceback and some print out the last line\u001b[39;00m\n\u001b[0;32m--> 289\u001b[0m \u001b[43m_engine_run_backward\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m    290\u001b[0m \u001b[43m    \u001b[49m\u001b[43mtensors\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    291\u001b[0m \u001b[43m    \u001b[49m\u001b[43mgrad_tensors_\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    292\u001b[0m \u001b[43m    \u001b[49m\u001b[43mretain_graph\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    293\u001b[0m \u001b[43m    \u001b[49m\u001b[43mcreate_graph\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    294\u001b[0m \u001b[43m    \u001b[49m\u001b[43minputs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    295\u001b[0m \u001b[43m    \u001b[49m\u001b[43mallow_unreachable\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[1;32m    296\u001b[0m \u001b[43m    \u001b[49m\u001b[43maccumulate_grad\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[1;32m    297\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n",
      "File \u001b[0;32m~/anaconda3/envs/py311/lib/python3.11/site-packages/torch/autograd/graph.py:768\u001b[0m, in \u001b[0;36m_engine_run_backward\u001b[0;34m(t_outputs, *args, **kwargs)\u001b[0m\n\u001b[1;32m    766\u001b[0m     unregister_hooks \u001b[38;5;241m=\u001b[39m _register_logging_hooks_on_whole_graph(t_outputs)\n\u001b[1;32m    767\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 768\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mVariable\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_execution_engine\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mrun_backward\u001b[49m\u001b[43m(\u001b[49m\u001b[43m  \u001b[49m\u001b[38;5;66;43;03m# Calls into the C++ engine to run the backward pass\u001b[39;49;00m\n\u001b[1;32m    769\u001b[0m \u001b[43m        \u001b[49m\u001b[43mt_outputs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\n\u001b[1;32m    770\u001b[0m \u001b[43m    \u001b[49m\u001b[43m)\u001b[49m  \u001b[38;5;66;03m# Calls into the C++ engine to run the backward pass\u001b[39;00m\n\u001b[1;32m    771\u001b[0m \u001b[38;5;28;01mfinally\u001b[39;00m:\n\u001b[1;32m    772\u001b[0m     \u001b[38;5;28;01mif\u001b[39;00m attach_logging_hooks:\n",
      "\u001b[0;31mOutOfMemoryError\u001b[0m: CUDA out of memory. Tried to allocate 1.24 GiB. GPU 0 has a total capacity of 10.90 GiB of which 508.44 MiB is free. Including non-PyTorch memory, this process has 10.40 GiB memory in use. Of the allocated memory 9.84 GiB is allocated by PyTorch, and 403.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)"
     ]
    }
   ],
   "source": [
    "args = TrainingArguments(\n",
    "    output_dir=\"./output_Qwen1.5\",\n",
    "    per_device_train_batch_size=6,\n",
    "    gradient_accumulation_steps=4,\n",
    "    logging_steps=100,\n",
    "    do_eval=True,\n",
    "    evaluation_strategy=\"steps\",\n",
    "    eval_steps=50,\n",
    "    num_train_epochs=5,\n",
    "    save_steps=50,\n",
    "    learning_rate=1e-4,\n",
    "    save_on_each_node=True,\n",
    "    gradient_checkpointing=True,\n",
    "    load_best_model_at_end=True\n",
    ")\n",
    "trainer = Trainer(\n",
    "    model=model,\n",
    "    args=args,\n",
    "    train_dataset=train_tokenized_id,\n",
    "    eval_dataset=eval_tokenized_id,\n",
    "    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),\n",
    ")\n",
    "trainer.train()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b3c51ae6-d5fc-45d1-b320-bd97dcf79eb7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from tqdm import tqdm_notebook\n",
    "pred_label = []\n",
    "for train_text in tqdm_notebook(test[\"文本\"].values):\n",
    "    prompt = f'''{train_text}'''\n",
    "    messages = [\n",
    "    {\"role\": \"system\", \"content\": \"现在进行意图分类任务\"},\n",
    "        {\"role\": \"user\", \"content\": prompt}\n",
    "    ]\n",
    "    text = tokenizer.apply_chat_template(\n",
    "        messages,\n",
    "        tokenize=False,\n",
    "        add_generation_prompt=True\n",
    "    )\n",
    "    model_inputs = tokenizer([text], return_tensors=\"pt\").to(device)\n",
    "    \n",
    "    generated_ids = model.generate(\n",
    "        model_inputs.input_ids,\n",
    "        max_new_tokens=512,\n",
    "        \n",
    "    )\n",
    "    generated_ids = [\n",
    "        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)\n",
    "    ]\n",
    "    \n",
    "    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]\n",
    "    pred_label += [response]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "py3.11",
   "language": "python",
   "name": "py3.11"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
